Improving Lower Bounds for the Quadratic Assignment 
Problem by applying a Distributed Dual Ascent Algorithm 

A. D. Gon9alves'', L. M. A. Drummond''*, A. A. Pessoa'', P. M. Hahn'^ 

'^Computer Science Department, Fluminense Federal University, Niteroi, RJ, Brazil 

''Production Engineering Department, Fluminense Federal University, Niteroi - RJ, Brazil 

''Electrical and Systems Engineering, The University of Pennsylvania, Philadelphia, PA 19104-6315, USA 



^ \ Abstract 

(^ . The application of the Reformulation Linearization Technique (RLT) to the Quadratic As- 

CN ! signment Problem (QAP) leads to a tight linear relaxation with huge dimensions that is 

)^ \ hard to solve. Previous works found in the literature show that these relaxations combined 

^ \ with branch-and-bound algorithms belong to the state-of-the-art of exact methods for the 

QAP For the level 3 RLT (RLT3), using this relaxation is prohibitive in conventional ma- 
chines for instances with more than 22 locations due to memory limitations. This paper 
presents a distributed version of a dual ascent algorithm for the RLT3 QAP relaxation 
that approximately solves it for instances with up to 30 locations for the first time. Al- 
though, basically, the distributed algorithm has been implemented on top of its sequential 
Q ■ conterpart, some changes, which improved not only the parallel performance but also the 

quality of solutions, were proposed here. When compared to other lower bounding meth- 
ods found in the literature, our algorithm generates the best known lower bounds for 26 
out of the 28 tested instances, reaching the optimal solution in 18 of them. 
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■ 1. Introduction 

, , Given A^ objects, A^ locations, a flow fik from each object / to each object k,k i^ i, and 

'^ I a distance djn from each location j to each location n, n i^ j, the quadratic assignment 

- - -' problem (QAP) consists of assigning each object / to exactly a location j. We wish to 

find: 

N N N N 

min J] 2] 2 2 fii'^J"^iJ^i'" ■ xeX, xe{0,l} (1) 

(=1 j=\ k=i ,,=1 
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Initially presented by lKoopmans & BeckmannI (Il957r) . the QAP has practical applica- 
tions in several areas, such as facility layout, electronic circuit board design, construction 
planning, etc. The QAP is one of the most difficult and studied combinatorial optimization 
problems found in OR literature. Usuall y, difficult ins t ances require a great deal of com- 
putational effort to be solved exactly. In I Adams et al.l (|2007[) . for example, a 30-location 



instance is solved on a single cpu of a Dell 7150 PowerEdge server in 1,848 days. Thus, 
good lower bounds are crucial for solving instances with more than 15 locations in rea- 
sonable processing time. They would allow that a large number of alternative solutions is 
discarded during the search for the optimal solution in the branch-and-bound tree^ 



A summary of the techniques used for calculat i ng low er bounds is presented in 



Loiola et al. 



(l2007h . In the QAPLIB website 



Burkard. et al. 



199 lb . a table showing lower bounds 



for each instance of the site i s presented 



Burer & Vandenbusschel(|2006[) . 



Adams et al. 



he b est l ower bounds were achieved by 



( 2007 ) and 



Hahnet all (2012). The dual as 



Hahn et al. 



(l2012h . calculates 



cent algorithm based on the RLT3 formulation, described in 
tight lower bounds, but the use of such technique in conventional machines for instances 
with more than 25 locations is impossible due to its large memo r y requ irements. For ex- 
ample, to solve an instance of 25 locations, Hahn, in lHahn et al.l (|2012|) . used a host with 
173 GB of shared memory. Recently, a very difficult instance with 30 locations has been 



solved exactly also using the RLT3 formulation (see http://www.seas.upenn.edu/qaplib/news.html I 
In this case, the authors used the same cluster of machines, which contains hosts with up 
to 2 TB of shared memory. 

The contribution of this paper is the propos al of a distributed application developed 



Hahn et al. 



2012|) . but not equivalent to 



on top of the sequential algorithm proposed in 
it, since our new algorithm has some important changes, which improve not only the per- 
formance but also the quality of RLT3 lower bounds for some instances. This distributed 
algorithm executes on a conventional cluster of computers and generates the best known 
lower bounds for 26 out of the 28 tested instances, reaching the optimal solution in 18 of 
them. 



2. Reformulation-linearization technique applied to the QAP 

The reformulation-linearization technique was initially developed by 



Adams & Sherali 



(|l986h . aiming to generate tight linear programming relaxations for discrete and contin- 
uous nonconvex problems. For mixed zero-one programs involving m binary variables, 
RLT establishes an m-level hierarchy of relaxations spanning from the ordinary linear 
programming relaxation to the convex hull of feasible integer solutions. For a given 



z e {/, ..,m}, the level-z RLT, or RLTz, constructs various polynomial factors of degree 
z consisting of the product of some z binary variables Xj or their comple ments (1 - xj). 



We fin d in the li terature various RLT levels applie d to the QAP, RLT l in iHahn & Grant 



mm . RLT2 in lAdamsetalJ(l2007h and RLT3 in lHahn et all (120121) . The RLT consists 



of two steps: the reformulation and lineariz ation. 



The RLT3 reformulation, presented in iHahn et al.l (|2012|) . consists of the following 
steps: (i) multiply each of IN assignment constraints by each of the A^^ binary variables 
Xij (applying RLTl); (ii) multiply each of the IN assignment constraints by each one of 
the A^^(A^ - 1)^ products XijXkn, k ^ i and n ^ j (applying RLT2); (iii) multiply each of 
the 2N assignment constraints by each one of the N^{N - 1)^{N - Tf products XijXtnXpq, 
p i^ k i^ i and q i^ n i^ j (applying RLT3). Moreover, remove the products xtjXkn if 
{k = i and n i^ j) or {k i^ i and n = j) in quadratic expressions; remove all products 
XijXknXpg if (p = i and q + j), (p = k and q 4^ n), (p 4^ i and q = j) or (p 4 k and q = n) 
in cubic expressions; and, finally, remove all products xijXknXpqXgh if {g = i and h 4 
j), (g = k and h 4 n), (g = p and h 4 q), (g 4 i and h = j), (g 4 k and h = n) or 
(g 4 p and h = q) in biquadratic expressions. 

The linearization consists of: (i) replace each product XijXkn, with / 4 k and j 4 n, by 
the continuous variable yijkn, imposing the constraints yijkn = ytnij (2 complementaries) 
for all (/, j, k, n) with i < k and j 4 n (applying RLTl); (ii) replace each product XjjXknXpq, 
with i 4 k 4 p and j 4 n 4 q,hy the continuous variable Zijknpq, imposing the con- 
straints Zijknpq = Zijpqkn = Zknijpq = Zknpqij = Zpqijkn = Zpqhtij (6 complcmentaries) for all 
(/, J, k, n,p, q) with i < k < p and j 4 n 4 q (applying RLT2); (iii) replace each product 
XijXknXpqXgh for Vijknpqgh, with i 4 k 4 p 4 g and j 4 n 4 q 4 h,hy the continuous variable 
Vijknpqgh, imposing the constraints Vijk„pqgh = vijknghpq = ... = Vghpqknij (24 complementaries) 
for all (/, J, k, n, p, q, g, h) with i <k < p < g and j 4 n 4 q 4 h (applying RLT3). 

At the end of RLT3 reformulation, we achieve the following objective function: 



N N N N N N N N N N N N 

2j Zj ^'^'^'■'' "^ Zj Zj Zj Zj ^'i^'^y'J'"' "^ Zj Zj Zj Zj Zj Zj ^iJknpqZijknpq 



mini 



(=1 j=l i=l ;=1 *=i "=l /=1 7=1 k=i „=l p=l 9=1 

N N N N N N N N 



+ Zj Zj Zj 2^ .Zj Zj Zj Zj ^'Mpqghyijknpqgh + i^B 

,•=1 j=\ k=l n=l p=l ,= 1 j=l ;,= 1 

(2) 
In the objective function @, consider the constant term LB = 0, each coefficient 



Bij = V (z, j), each coefficient Cijy, = fik x djn V (z, j, ^, n) with z ?5: ^ and j + n, each 
coefficient Dtjknpq = V (z, j, k, n, p, q) with i i^ k i^ p and j i^ n i^ q , each coefficient 
Eijknpqgh = V (z, J, fc, n, j!?, <?, ^, /z) with z fki^p ?i: jg and / i^ni^q + h. 



The dual ascent algorithm proposed in lHahn et al.l(|2012l) consists of updating the con 



stant term LB and the cost matrices B, C, D and E in such a way that the cost of any (inte- 
ger) feasible solution with respect to the modified objective function remains unchanged, 
while maintaining nonnegative coefficients. As a consequence of this property, the value 
of LB at any moment of the execution is a valid lower bound on the optimal solution cost 
for the QAP. In the light of these aspects, the following procedures are developed: 

I. Cost spreading: consists of the cost distributions from matrix B to C, from matrix 
C to D and from matrix D to E. In the cost spreading procedure from matrix B to 

C, for each (i, j), the coefficient Bij is spread through (A'^ - 1) rows of matrix C, 
i.e., each element C,yfa, is added by Bij / (N - I), "^ k i^ i and n i^ j. After such 
updating, Bij is updated to for each (z, j). The same procedure is repeated from 
matrix C to D, where each coefficient Cijtn is spread through (A^ - 2) rows of matrix 

D, and from matrix D to E, where each coefficient Dij^np^ is spread through (A'^ - 3) 
rows of matrix E. 



II. Cost c oncentration; in this procedure we used the Hungarian Algorithm. iMunkres 



19571) . to concentrate the costs from matrix E to D, from matrix D to C, from 
matrix C to 5 and from matrix B to LB. The cost concentrations from matrix E to 
D are represented as Dij^npq <— Hungarian(Eiji,npq). This procedure uses a matrix 
M with size (A^^ - 3)^ to receive the (A^^ - 3)^ coefficients of the submatrix Eij^npq'- 
for each {r,s = l,..,N - 3), M„ receives Eijknpqgh, where g (h) is the r-th row (s-th 
column) different from z, k,p (j, n, q) in the submatrix Eij^npq- Then, the Hungarian 
algorithm is applied to M to obtain the total cost to be added to Dij^^pq, and the 
coefficients of the submatrix Eijk„pg are replaced by the corresponding residual co- 
efficients from M. The same procedure is repeated as C,yfa, <— Hungarian{Dijkn), 
Bij <— Hungarian{Cij) and LB <— Hungarian{B). In these procedures, the sizes of 
M are (A^ - If, {N - if and A^^, respectively. 
III. Costs transfer between complementary coefficients: Differently from 



HahnetaL 



2012|) . the cost transfers always replace each coefficient by the arithmetic mean 
of all its complementaries. It is applied as follows: (i) In the matrix C, for each 
(z, j, k, n), djun ^ Cknij ^ (Cijicn + Cknij)/2, with z < k and ; i^ n; (ii) In the matrix 

-^5 lOr eacn \l, J, K, n, p, q), iJijknpq * ^ijpqkn ^ ^knijpq * ^knpqij ^ ^pqijkn ^ 
^pqknij ~ y'^ijknpq' '^ijpqkn' '^knijpq' ^knpqij' '^pqijkn' '^pqknijJ l^t Wlin I <. K <. p 



and 



j i^ n i^ q; (iii) In the matrix E, for each (z, j, k, n, p, q, g, h), 

■ ^ ^ ghpqknij ^ X^ijknpqgh ' ^ijknghpq ' ■•■ ' ^ ghpyknij ) / ^^^ 



^ijknpqgh ^ ^ijknghpq 



and j i^ n i^ q i^ h. 



ijknpqgh + t^ijknghpq + ••• + tLghpqknij) I -^^i Wltn I < K < p < 



3. Distributed Algorithm 

In our distributed version, consider T the set of hosts running the application, and let 
Rt (Rf e T) be the identification of a host. Let fn- and djn be flow and distance matrices re- 
spectively, according to equation ([T), LB, the lower bound, and B,C,D and E, the matrices 
presented in the objective function ©. Consider G,y as a set composed of submatrices B, 
C, D, E with the same (/, j) stored and processed on Rf Sets of G are evenly distributed 
among the hosts. See Figure 1 for an example with twenty hosts, running an instance of 
N = 20. In this figure, the set Gisj composed of submatrices Bi^j, Cisjxn, D\ijxn,p,q 
and Eis,i,k,n,p,q,g,h is stored and processed on the host R^. Other forms of mapping can be 
accomplished, since G,y is used as a load distribution unit. 
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Figure 1 ; Example of allocation of sets G,-^ on 20 hosts 

The RLT3 algorithm applied to the QAP requires a lot of RAM memory to store the 
coeflicients of the matrices. An instance with A^ = 30, for example, requires around 1.6 
TByte to store the matrix E, which is composed of N^ x {N - l)^ x {N - 2)^ x {N - 3)^ 
elements, each one keeping an integer or floa t data (4 bytes). Although some improve- 
ments have been proposed in 



Hahnetal 



(|2012|) . the required memory goes on being much 
bigger than the provided by modem computers. 

In the distributed algorithm, complementaries belonging to diff"erent sets can be al- 
located on different hosts, requiring that hosts communicate among themselves during 



their executions. The distributed algorithm runs several iterations and at each of them, 
four steps are executed. In the first one, complementary costs of matrix E are exchanged. 
Complementary costs stored in R^, needed in R^, are transferred through messages from 
Rx to R^, denoted as Comp{E)xz- In the next two steps, complementary costs of matrices 
D and C, are sent through Comp{D)xz and Comp{C)xz messages, respectively. In the final 
stage, matrices B are transmitted through MensiB)^^ messages. In small instances, up to 
N = 12, communication overhead does not impact the performance negatively. However, 
in bigger instances, the communication of complementary costs of matrix E can represent 
up to 70% of the total execution time in instances with A^ = 30. 

The steps of the distributed algorithm executed in the process Rt are described next. 
1- Initialization: LB <— 0, By <— V (z, j), Cijkn <— fit x djn V (z, j, k, n) with z i^ k 
and j + n, DijUnpq <- V (i,j,k,n,p,q) with i i^ k i^ p and j i^ n + q, Eij^npqi-h <- 
V (z, J, k, n, p, q, g, h) with i i^ k i^ p i^ g and j i^ n i^ q i^ h, cont «— 1, lim «— total of 
iterations and optimal <— optimal solution or best known solution cost. 

2 - Transferring complementaries of matrix C: For each R^ e T and R^ i^ Rt, and 
for each {i, j,k,n) \ Gij allocated in Rt and Gtn allocated in R^, i < k and j + n, store 
coefficients Cijkn in Comp{C)ts "i i <k and j i^ n. Send Comp{C)ts to R^. Upon receiving 
messages from other hosts, for each G,y allocated ini?j, C,yjt„ <— {Cijkn + Cknij)/2. 

3 - Cost concentration from matrix C to matrix B: For each Gij allocated in Rt, con- 
centrate the coefficients from matrix C to B, by executing the Hungarian Algorithm, 
Bij <— Hungarian(Cij). 

4- Transferring matrix B: For each (z, j) \ Gij allocated in Rt, store coefficients 5,y in 
Mens(B). Broadcast Mens{B) to all hosts. After receiving messages from all other hosts, 
update local matrix B. 

5- Cost concentration from matrix B to LB: LB <^ Hungarian(B). 

6 - Loop: Repeat until cont = lim or LB = optimal. The loop termination condition is 
achieved when the total number of iterations reaches the previously defined limit (cont = 
lim) or the optimal solution is equal to the current lower bound (LB = optimal). 

7 - Cost spreading from matrix B to C: For each (z, j) \ Gij allocated in Rt, spread Bij 
through (A^- 1) submatrix rows of Cij. Each cost element Cijkn is increased by Bij / (N- 1) 
"^ki^ieji^n. 

8 - Cost spreading from matrix C to D: For each (z, j, k, n) \ Gij allocated in Rt and z ^ k 
and j + n, spread Cijkn through (A^ - 2) submatrix rows of Dijkn ■ Each cost element Dijknpq 
is increased by Cijkn / (N -2)"^ p i^ i,k and q 4^ j, n. 

9 - Cost spreading from matrix D to E: For each (z, j, k, n, p, q) \ Gij allocated in Rt and 



i i^ k, p and j ^ n, q, spread Dijknpq through (A'^ - 3) submatrix rows of Etjknpq- Each cost 
element Eijknpqgh is increased by Dijknpq / (N -3)V g i^ i,k,p and h 4^ j, n, q. 

10 - Cost transfer between complementary coefficients of matrix E: For each Rs &T 
and Rs i^ Rt, for each (i, j,k,n,p,q,g,h) \ Gjj allocated in Rt and (Gkn,Gpq or Ggh) 
allocated in R^ and i < k < p < g and j i^ n i^ q i^ h, include the coefficients Eijknpqgh 
in Comp(E)fs. Send message containing Comp{E),s. Upon receiving messages from all 
hosts, for each (z, ;, k, n, p, q, g, h) \ Gij allocated in Rt, Eijknpqgh "- Eijknghpq ^ Eijp^kngh "- 

t-iijpqghkn ^ ^ijghknpq ^ ^ijghpqkn ^ X^ijknpqgh ' ^ijknghpq ' ■•■ ' ^ ghpqknij ) I '^^ ■ 

11 - Cost concentration from matrix E to D: For each (/, j, k, n, p, q) \ Gij allocated in 
R,, concentrate the submatrices from E to D, i.e., Dijknpq <— Hungarian(Eijknpq). 

12 - Cost transfer between complementary coefficients of matrix D: For each R^ e T 
and Rs 4^ Rt, for each (i,j,k,n,p,q) \ Gij allocated in R, and (Gkn or Gpq) allocated 
in Rs and i < k < p and j 4 n 4 q, include the coefficients Dijknpq in Comp{D)ts- 
Send message containing Comp(D)ts. Upon receiving messages from all hosts, for each 

(}: J: ^1 ^1 P^ Q) I ^ij ^ ^?J J-^ijknpq ^ J-^ijpqkn ^ K'^ijknpq ' J-^ijpqkn ' J-^knijpq ' J-^knpqij ' J-^pqijkn ' 
Dpqknij)!^- 

13 - Cost concentration from matrix D to C: For each (/, j, k, n) \ Gij allocated in Rt, 
concentrate the submatrices from D to C, i.e. , Cijkn <— Hungarian{Dijkn)- 

14, 15, 16, and 17 - These steps are identical to Steps 2, 3, 4, and 5, respectively. 
18 - loop end: Increase the variable cont and return to Step 6. 

Compared to the sequential version, the following modifications have been applied 
in the distributed algorithm: (i) use of floating point numbers instead of integers for cost 
coefficients; (ii) use of arithmetic means to transfer costs among complementary coeffi- 
cients; (iii) execution of all cost transfers among complementary coefficients before con- 
centration; and (iv) never spreading from LB to matrix B. 

From all these diff"erences, the most important one is t hat of item (ii). In the sequen- 



tial dual ascent algorithm proposed in iHahn et al.l (|2012|) . cost transfers are performed 



with the aim of increasing all cost coefficients of the current submatrix M, by pushing 
residual cost from its complementaries, before applying the cost concentration in that ma- 
trix. This approach imposes a sequential handling of submatrices at the same RLT level. 
Taking arithmetic means allow that such matrices are processed in parallel but prevents 
from using residual costs resulting from the Hungarian algorithm in other matrices at the 
same RLT level in the same iteration. This reuse of costs is not possible because all costs 
are evenly distributed among all complementaries before all cost concentrations are per- 
formed at that level. Initially, we expected that such modification would significantly slow 
down the convergence of the lower bound and/or substantially reduce its quality but the 



experiments repori;ed in the next section show that neither effects are observed. In fact, 
we obtained better lower bounds in some cases. 

4. Experimental Results 



Table 1 : Comparison between the newly proposed distributed algorithm and other techniques 
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Optimal 


BV04 
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HZ07 






Version 








LB 


gap 


time(s) 


Speedup 


hosts 


iterations 


hadl4 


2724 


0.00%* 


- 


- 


2724 


0.00% * 


559 


1.62 


4 


29 


hadl6 


3720 


0.13% 


0.00% * 


0.02% 


3720 


0.00% * 


744 


5.83 


8 


22 


had 18 


5358 


0.11% 


0.00% * 


0.02% 


5358 


0.00% * 


5456 


5.27 


9 


59 


had20 


6922 


0.16% 


0.00% * 


0.03% 


6922 


0.00% * 


16118 


NA 


16 


109 


kra30a 


88900 


2.50% 


2,98% 


- 


88424 


0.54% 


196835 


NA 


90 


162 


nugl2 


578 


1.73% 


0.00% * 


0.14% 


578 


0.00% * 


73 


2.75 


4 


16 


nugl5 


1150 


0.78% 


0.00% * 


0.08% 


1150 


0.00% * 


360 


5.28 


9 


22 


nugl6a 


1610 


0.75% 


- 


- 


1610 


0.00% * 


1132 


5.73 


8 


34 


nugl6b 


1240 


1.69% 


- 


- 


1240 


0.00% * 


1294 


5.71 


8 


39 


nugl8 


1930 


1.92% 


- 


0.00% * 


1930 


0.00% * 


7172 


5.36 


9 


78 


nug20 


2570 


2.49% 


2.41% 


0.14% 


2570 


0.00% * 


30129 


NA 


20 


249 


nug22 


3596 


2.34% 


2.36% 


0.08% 


3596 


0.00% * 


41616 


NA 


22 


157 


nug24 


3488 


2.61% 


- 


- 


3478 


0.28% 


173520 


NA 


24 


300 


nug25 


3744 


3.29% 


- 


- 


3689 


1.44% 


172020 


NA 


25 


211 


nug28 


5166 


2.92% 


- 


- 


5038 


2.48% 


171783 


NA 


49 


118 


nug30 


6124 


3.10% 


5.78% 


- 


5940 


3.00% 


229583 


NA 


100 


119 


roul5 


354210 


1.13% 


0.00% * 


0.00% * 


354210 


0.00% * 


323 


5.78 


9 


20 


rou20 


725520 


4.19% 


3.60% 


0.03% 


720137 


0.74% 


37079 


NA 


25 


300 


tailSa 


388214 


2.86% 


- 


- 


388214 


0.00% * 


737 


6.18 


9 


46 


tail 7a 


491812 


3.11% 


- 


- 


491812 


0.00% * 


1259 


13.18 


17 


46 


tai20a 


703482 


4.52% 


3.93% 


703482 * 


698271 


0.74% 


45720 


NA 


25 


300 


tai25a 


1167256 


4.66% 


6.48% 


- 


1122200 


3.87% 


101170 


NA 


25 


124 


tai30a 


1818146 


6.12% 


7.25% 


- 


1724510 


5.15% 


112085 


NA 


100 


58 


tho30 


149936 


4.75% 


9.82% 


- 


142990 


4.63% 


145713 


NA 


100 


79 


chrlSa 


11098 


0.00% * 


- 


- 


11098 


0.00% * 


1892 


5.32 


9 


20 


chr20a 


2192 


0.18% 


- 


- 


2192 


0.00% * 


5914 


NA 


16 


39 


chr20b 


2298 


0.13% 


- 


- 


2298 


0.00% * 


3708 


NA 


16 


24 


chr22a 


6156 


0.03% 


- 


- 


6156 


0.00% * 


5321 


NA 


22 


20 



The application was implemented using the programming language C++ with the 



library Inte 



SilvaetaL 



MPI library. The experiments were performed in the Netuno Cluster, see 



(|2011h . a cluster composed of 256 hosts, interconnected by infiniband. Each 



host consists of a two Intel Xeon E5430 2.66GHz Quad core processor with 12MB cache 
L2 and 16 GB of RAM per host. 

A unique process is executed per host, allowing that it uses the total available memory 
without resource contention usually caused by process concurrency. So, only one core 
per host is used to execute the application. 

For evaluation of the proposed distributed algorithm, the application terminates when 
the optimal solution is found or when a total of 300 iterations is executed, respecting 
a time limit (usually about three days per instance) that varies according the machine 
availability in the cluster. 



8 



Table [H presents the results for different instances and sizes from the QAPLIB. In 
the first column of Table [B there are the instance names and the corresponding d imen- 



(1968), with 



sions. For example, nuglO represents an instance nug, from iNugent et al. 

size A^ = 20. In the second column, there are the optimal values for each instance. The 

third col umn (BV04) contains the gaps obtained by the lift-and-project relaxation pro- 



posed in iBurer & Vandenbusschel (|2006|) . At the fourth column ( HHOl), one finds th e 



gaps obtained by the RLT2 based dual ascent algorithm proposed in I Adams et al.l (|2007h 



In the fifth column ( HZ07), there are th e gaps obtained by the RLT3 based dual ascent 



algorithm proposed in lHahn et al.l (120121) . The results presented for the last two methods 
were obtained from the QAPLIB website, which does not contain values for all instances. 
In the sixth column, we show the lower bounds obtained in the RLT3 distributed ver- 
sion proposed in this paper. In the seventh column, we present the corresponding gaps, 
in the eighth column, the execution times in seconds, and in the last three columns, the 
speedups obtained via parallelism, the number of hosts used, and the number of iterations 
performed. 

Also in Table[T] notice that the lower bounds that correspond to optimal solution costs 
or gaps that are zero are marked with an asterisk, and those which are the best known 
gaps are in bold printed. For some instances, it was not possible to execute the sequential 
versions because of the memory constraints, in those cases the calculation of speedups 
were not applicable, as indicated in the table (NA). 

5. Conclusion 

The distributed version achieved goods results compared with other proposals, reach- 
ing the best known bounds of 26 out of 28 instances, being 18 of them the optimal solu- 
tions. The distributed algorithm allowed the execution of instances with size N = 2S and 
N = 30 for the first time using RLT3. Those good results were achieved due to the use 
some of parallelism and the changes proposed in the original sequential code. 
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